Contents

0.0.1 Abstract

An critical task in genomic data analyses for stratified medicine is class discovery which is accomplished through clustering. Consensus clustering has emerged as a very popular and effective machine learning approach for this task (Monti et al., 2001; Șenbabaoğlu et al., 2014). The algorithm aims to solve class number decision using a resampling method that produces a consensus rate between samples (a measure of stability). However, a critical problem with this, and other clustering approaches, is it does not use a reference datset nor test the null hypothesis. This was the basis for the development of Monte Carlo Consensus Clustering (M3C). M3C use a multi-enabled monte carlo simulation to generate a distribution of PAC (proportion of unambiguous clustering) scores from a null dataset with the same gene-gene correlation structure as the real, for deriving A) the PAC statistic and B) an empirical p value. These two metrics substantially improve accuracy, allow rejection of the null hypothesis K = 1, and provide confidence in the form of a p value for different clustering solutions. This approach deals with a number of pitfalls in current clustering approaches and thus provides a suitable framework for hypothesis testing.

0.0.2 Prerequisites

M3C recommended spec:

A relatively new and fast multi-core computer or cluster.

M3C requires:

A matrix or data frame of normalised expression data (e.g. microarray or RNA-seq) where columns equal samples and rows equal features.

The data should be filtered to remove features with no or very low signal, and filtered for variance to reduce dimensionality (unsupervised), or p value from a statistical test (supervised).

M3C also accepts optionally:

Annotation data frame, where every row is a patient/sample, columns refer to meta-data e.g. age, sex. M3C will automatically rearrange your annotation to match the clustering output and add the consensus cluster to it. Note, this only works if the IDs (column names in data) match a column called “ID” in the annotation data frame.

0.0.3 Example workflow

The M3C package contains the GBM cancer microarray dataset for testing. There is an accepted cluster solution of 4. First we load the package which also loads the GBM data.

0.0.4 Running M3C

Next, we run the algorithm with 100x monte carlo iterations and 100x inner replications for the real data and reference. The iterations parameter can be increased to 1000 later for a final analysis. Plots from the tool and an .csv file with the numerical outputs are printed into the working directory (printres = T). We will set the seed in this example, incase you wish to repeat our results exactly. We will add the annotation file for easier plotting later on (des = desx).

We always recommend saving the workspace after M3C if running higher numbers of iterations, because the runtimes can be quite long. We have removed hierarchical clustering and k means from this algorithm, because they performed poorly in our simulations or too slow relative to PAM. M3C uses PAM with Euclidean distance. The reference method that generates the null datasets is best left to the default setting which is ‘reverse-PCA’ that maintains the correlation structure of the data using the principle component loadings.

The scores and p values are contained within the res$scores object. So we can see the PAC_STAT reaches a maxima at K = 4 (pac_stat=0.33), the monte carlo p value supports this decision (p=0.05). This means we can reject the null hypothesis that K = 1 for this dataset because we have achieved significance versus a dataset with no clusters, but with identical gene-gene correlation structure. For p values that can extend beyond the lower limits imposed by the monte carlo simulation we use estimated parameters from the simulation to generate a beta distribution, the BETA_P in this case study is 0.031. We may of course want to take a look at other significant or near significant clustering solutions and how they relate to our variables of investigation.

res$scores
##    K  PAC_REAL   PAC_REF        RCSI MONTECARLO_P     BETA_P   P_SCORE
## 1  2 0.6734694 0.5773796 -0.15394262   1.31372549 0.69721054 0.1566361
## 2  3 0.4473469 0.5410857  0.19024326   0.37254902 0.19923640 0.7006313
## 3  4 0.3404082 0.4711755  0.32508528   0.07843137 0.03313314 1.4797374
## 4  5 0.3200000 0.4018041  0.22764361   0.17647059 0.07848793 1.1051971
## 5  6 0.3102041 0.3474204  0.11330519   0.47058824 0.22889997 0.6403543
## 6  7 0.2840816 0.3031673  0.06502332   0.62745098 0.33674564 0.4726980
## 7  8 0.2653061 0.2700408  0.01768878   0.90196078 0.46514316 0.3324134
## 8  9 0.2465306 0.2413469 -0.02125070   1.11764706 0.56967334 0.2443741
## 9 10 0.2130612 0.2190531  0.02773443   0.88235294 0.44629390 0.3503790

Next, we will take a look at some of the plots M3C generates.

This is a CDF plot of the GBM data we feed into the algorithm. We are looking for the value of K with the flattest curve and this can be quantified using the PAC metric (Șenbabaoğlu et al., 2014). Note, we have removed delta K from the algorithm because as we have recently shown, it performs badly. In the CDF plot we can see the overfitting effect of consensus clustering where as K increases so does the apparent stability of the solution, this we correct for by using a reference.

Figure 1: CDF plot for real data

This figure below is the PAC score, we can see an elbow at K = 4 which is suggestive this is the best K. However, we are not able to quantify how confident we are in this value without comparison versus a null dataset. Indeed, this may just be random noise.

Figure 2: PAC score for real data

We then derive the PAC statistic which takes into account the reference PAC scores. This metric should be used instead of the PAC score for deciding class number, where the maxima is the number of clusters in the data. In this example it is 4. Note, the biology is of course important as well, so we may want to have a look at K = 5.

Figure 3: PAC Statistic

Finally we calculate a p value from the distribution. This adds a measure of confidence to the PAC statistic and can be used to support arguments there is genuine structure in the data opposed to just noise. If p values do not reach significance it is implying the dataset is more similar to random data. In the GBM dataset we can see K = 4 and K = 5 reach signfiicance with an alpha of 0.05 and K = 4 is the most significant.

Figure 4: Empirical p values

Now we are convinced there are 4 clusters within this dataset which are not likely simply to have occurred by chance alone we can turn to examine the output objects that M3C generates. These facilitate clustering and heatmap rendering for publication.

0.0.5 Understanding M3C outputs

The first 3 lines below extract the expression data and the annotation data from the results object after running M3C. If we wanted to extract the data for 5 clusters, we would simply replace in 4 in the below lines to 5 (and so on). We scale the data here row wise according to z-score prior to some light compression for visualisation purposes in the heatmap. Remember to set Colv = NA for heatmap plotting because we have already ordered the data column or samplewise, M3C does that for you.

# get the data out of the results list (by using $ - dollar sign)
data <- res$realdataresults[[4]]$ordered_data # this is the data
annon <- res$realdataresults[[4]]$ordered_annotation # this is the annotation
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # this is the consensus matrix

# normalise and scale the data
data <- t(scale(t(data))) # z-score normalise each row (feature)
data <- apply(data, 2, function(x) ifelse(x > 4, 4, x)) # compress data within range
data <- apply(data, 2, function(x) ifelse(x < -4, -4, x)) # compress data within range

# get some cool colour palettes from the ggsci package and RColourBrewer
ann_colors <- ggsci::pal_startrek("uniform")(4) # star trek palette
ann_colors2 <- ggsci::pal_futurama()(4) # futurama palette
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(10, "RdBu"))(256))
NMF::aheatmap(data, annCol = annon, scale = 'row', Colv = NA, distfun = 'pearson', 
         color = gplots::bluered(256), annColors = list(class=ann_colors, consensuscluster=ann_colors2))

Figure 5: aheatmap of GBM consensus clusters with tumour classification

The last thing we may want to do for publications is print the consensus matrix for our optimal clustering solution (K = 4), this should be quite crisp reflecting the flat distribution of the CDF. Note, we could have printed all the consensus heatmaps into the current directory using ‘printheatmaps = T’, however, in this case we will show how to extract the matrix, cluster and render it with the aheatmap function from the NMF package. We can see in this heatmap of the consensus matrix the clusters look quite clear supporting our view that there is 4 clusters.

# time to plot the consensus matrix for the optimal cluster decision
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # pull out the consensus matrix from the k = 4 object
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "Reds"))(256)) # get some nice colours
NMF::aheatmap(ccmatrix, annCol = annon, Colv = NA, Rowv = NA, 
              color = rev(pal), scale = 'none') # plot the heatmap

Figure 6: aheatmap of GBM consensus matrix

So we have now covered the basic use of M3C. Generally we recommend complexheatmap for rendering of heatmaps as it is more customisable for publications, however, aheatmap (from the NMF package) is also nice and easy to use.

Happy cluster hunting!

We will now turn to an extra function of the package, which can be used for benchmarking clustering tools.

0.0.6 Generating simulated data

We have included a function for generating simulated data as part of the package to test clustering algorithms. This cluster simulator is simple to use. This simulator, using the code below, generates a dataset with 225 samples, 900 features, a radius cut-off for the initial square of 8, a cluster number of 4, a seperation of clusters of 0.75, and a degree of noise added to each co-ordinate of 0.025. After running, a PCA will print of the data so we can visualise the 4 clusters in principle component space.

Figure 7: PCA of a 4 cluster simulated dataset

0.0.7 Conclusions

In this tutorial, we have seen that M3C can provide statistical validation of clustering results as well as using relative scores. Although our monte carlo approach is computationally expensive, this is justified by 3 reasons; 1) increased performance relative to other methods, 2) the ability to reject the null hypothesis, 3) To assess confidence of the results using a p value instead of a relative score. In our studies of real data, we have seen M3C effectively drives class discovery throughout different stratified medicine projects.

0.0.8 References

Monti, Stefano, et al. “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.” Machine learning 52.1 (2003): 91-118.

Șenbabaoğlu, Yasin, George Michailidis, and Jun Z. Li. “Critical limitations of consensus clustering in class discovery.” Scientific reports 4 (2014): 6207.